Lecture 4: Visualization

EC7412 Part II: Data Science for Economists

Adam Altmejd Selder

Swedish Institute for Social Research (SOFI)

April 23, 2025

Introduction

  • Introduction

  • ggplot intro

  • Plot types

  • Grouping and summarizing

  • Styling

Introduction

Why look at data?

The greatest value of a picture is when it forces us to notice what we never expected to see.

John Tukey

Tufte (1983)

Introduction

Why look at data?

  • In 1855, a horrible cholera epidemic was ravaging London
  • Doctors believed “miasma” (bad air) had caused the disease
  • John Snow (“father of epidemiology”) argued it was transmitted through water
  • But it wasn’t the study’s identification strategy that convinced people, it was its visualization

Snow (1855)

Introduction

Why look at data?

 

Snow (1855)

Introduction

Anscombe’s quartet

library(data.table)
library(modelsummary)
dt = as.data.table(datasets::anscombe)
datasummary(All(dt) ~ N + mean + SD, data = dt)
N mean SD
x1 11 9.00 3.32
x2 11 9.00 3.32
x3 11 9.00 3.32
x4 11 9.00 3.32
y1 11 7.50 2.03
y2 11 7.50 2.03
y3 11 7.50 2.03
y4 11 7.50 2.03

Anscombe (1973)

Introduction

Anscombe’s quartet

Anscombe (1973)

Introduction

Why look at data?

  • Summary statistics are not enough! They can hide critical patterns and differences in data.
  • Visualization helps us:
    • Explore for patterns, trends, outliers, and relationships
    • Understand complex datasets more intuitively
    • Analyze insights missed by numerical methods
    • Evaluate models (e.g., residual plots)
    • Communicate findings clearly and effectively

Introduction

Summary statistics hide patterns

Introduction

Summary statistics hide patterns (cont.)

 

Matejka and Fitzmaurice (2017)

Introduction

Wrangling ↔︎ Visualization

 
  • Exploring data visually is often the best way to understand it and to discover issues

Wickham, Cetinkaya-Rundel, and Grolemund (2023)

Introduction

Data verification

  • Some of your data will be wrong
  • Finding out which and how early saves lots of time and energy
    • You don’t want to realize halfway through a project that an important category has been coded as missing.

Verification tasks:

  • Browse data (using View() or just print it)
  • Check descriptives: missing, unique values, mean/median/min/max
  • Plot data: scatters, histograms, densities

Introduction

Internal consistency: Is the data represented correctly?

  • Potential sources of problems:
    • Incomplete or duplicated data
    • Missing or incorrectly coded values
    • Encoding problems

Verifying the variable wage, we might ask:

  • Do the values make sense?
  • Is there bunching at high-frequency values?
  • Are zeros and missing coded separately?
  • Does everyone classified as not working have 0 wage?

Introduction

External consistency: does the data represent what it is supposed to

  • Potential sources of problems:
    • Bad survey questions
    • Measurement error
    • Sampling bias

Verifying the variable wage, we might ask:

  • Are any government transfers included?
  • How do population means compare to official statistics?
  • Do correlations with related variables make sense?

Introduction

Visualization is also communication

  • Figures are often the most effective way to communicate results
  • Much of what you learn will be just as useful for communication

Introduction

Tufte: Graphical excellence

[Graphical excellence] is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space

Edward Tufte

Introduction

Tufte: Graphical excellence

 

Tufte (1983)

Introduction

Bad graphs: Pandemic TV edition

 

Introduction

Let’s plot the FOHM data ourselves

fohm_dt <- fread("fohm_c19_death_data.csv")[!is.na(date)]
ggplot(data = fohm_dt[publication_date == "2020-04-13"],
       aes(x = date, y = N)) +
  geom_bar(stat = "identity") +
  scale_x_date(limits = as.Date(c("2020-03-01", "2020-04-30"))) +
  geom_vline(xintercept = as.Date("2020-04-08")) +
  geom_hline(yintercept = 70) +
  ylim(0, 120) +
  theme_ipsum()

Introduction

What happened then?

ggplot(data = fohm_dt[publication_date %between% list("2020-04-13", "2020-05-15")],
       aes(x = date, y = N, group = publication_date)) +
  geom_bar(stat = "identity") +
  scale_x_date(limits = as.Date(c("2020-03-01", "2020-04-30"))) +
  transition_time(publication_date) +
  ease_aes('linear') +
  geom_vline(xintercept = as.Date("2020-04-08")) +
  geom_hline(yintercept = 70) +
  ylim(0, 120) +
  theme_ipsum()

Introduction

Bad graphs 2: New York Times (2016)

 

Healy (2018)

Introduction

Good visualization leverages human perception

  • We are good at comparing:
    • Position along a common scale
    • Length
  • We are less accurate at judging:
    • Angle
    • Area/volume
    • Color intensity/shade (relative comparisons dominate)

Introduction

Perception: examples

 

Healy (2018)

Introduction

Perception: examples

 

Healy (2018)

ggplot intro

  • Introduction

  • ggplot intro

  • Plot types

  • Grouping and summarizing

  • Styling

ggplot intro

Grammar of graphics

  • We will learn how to make plots in R using the popular ggplot2 package
  • ggplot2 implements the “grammar of graphics” graph-building paradigm
  • Builds plots layer by layer, adding geometries (“geoms”)

ggplot intro

A basic ggplot template

ggplot(data = <DATA_FRAME>,
       mapping = aes(<MAPPINGS>)) +
  <GEOM_FUNCTION>() +
  # Add more layers (optional)
  <SCALE_FUNCTION>() +
  <THEME_FUNCTION>() +
  <LABS_FUNCTION>()

ggplot intro

Using the gapminder data

library(gapminder)
tt(dplyr::sample_n(gapminder, 10))
country continent year lifeExp pop gdpPercap
Uruguay Americas 1957 67.044 2424959 6150.7730
Philippines Asia 1972 58.065 40850141 1989.3741
Burkina Faso Africa 1987 49.557 7586551 912.0631
Kuwait Asia 1972 67.712 841934 109347.8670
Eritrea Africa 1982 43.890 2637297 524.8758
Morocco Africa 1992 65.393 25798239 2948.0473
Sudan Africa 1982 50.338 20367053 1895.5441
Romania Europe 2002 71.322 22404337 7885.3601
Senegal Africa 1977 48.879 5260855 1561.7691
Guinea-Bissau Africa 1987 41.245 927524 736.4154

ggplot intro

Creating the ggplot object (cont.)

ggplot(data = gapminder)

ggplot intro

Creating the ggplot object (cont.)

p <- ggplot(data = gapminder,
            mapping = aes(x = continent,
                          y = lifeExp))
p

We set the aesthetic mapping of the ggplot object to columns of the gapminder data frame.

ggplot intro

Adding a layer

p + geom_point()

To draw something on the canvas we need to add a geometry layer. For example a scatter with geom_point().

ggplot intro

Adding a layer (cont.)

p + layer(
  mapping = NULL,
  data = NULL,
  geom = "point",
  stat = "identity",
  position = "identity"
)

geom_point() is a shortcut for layer(...). Setting mapping and data to NULL means they are inherited from p.

ggplot intro

Adding a boxplot

p + geom_boxplot()

Let’s add a boxplot instead to study the distribution of continuous variables across multiple groups.

ggplot intro

Adding another geom

p +
  geom_boxplot() +
  geom_jitter()

To get a more visual sense of where the data is located we can re-add the actual data points.

ggplot intro

Styling geoms

p +
  geom_boxplot(outlier.color = "red") +
  geom_jitter(position = position_jitter(width = 0.1, height = 0),
              alpha = 0.25)

Highlighting outliers and making the points less prominent. Alpha means transparency.

ggplot intro

Styling scales and labels

p +
  geom_boxplot(outlier.color = "red") +
  geom_jitter(position = position_jitter(width = 0.1, height = 0),
              alpha = 0.25) +
  scale_y_continuous(n.breaks = 5,
                     limits = c(0, 100),
                     expand = expansion(c(0,0.05))) +
  labs(y = "Life expectancy (years)",
       x = "Continent") +
  theme_bw()

Starting the y-axis at zero usually good. Here, I also added a simple theme and formatted the axis labels.

ggplot intro

Adding data labels to outliers

library(ggrepel)
is_outlier = function(x) {
  return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}

plotdata = as.data.table(gapminder) |>
    _[, outlier := is_outlier(lifeExp), by = "continent"]

p +
  geom_boxplot(outlier.color = "red") +
  geom_jitter(data = plotdata[outlier == FALSE],
              position = position_jitter(width = 0.1, height = 0),
              alpha = 0.25) +
  scale_y_continuous(n.breaks = 5,
                     limits = c(0, 100),
                     expand = expansion(c(0,0.05))) +
  labs(y = "Life expectancy",
       x = "Continent") +
  theme_bw() +
  geom_text_repel(data = unique(plotdata[outlier == TRUE], by = "country"),
                  aes(label = country))

ggplot intro

A truly powerful plotting package

  • What if we want
    • To plot the relationship between GDP and life expectancy
    • Divided by countries and continents
    • To see how countries have developed over time
    • Take population size into account
  • 6 dimensions 🤯
  • Hans Rosling managed!

ggplot intro

Rosling’s famous animated plot

library(gganimate)
library(hrbrthemes)
ggplot(plotdata[continent != "Oceania"],
       aes(x = gdpPercap, y = lifeExp, size = pop, color = country)) +
  geom_point(alpha = 0.6, show.legend = FALSE) +
  scale_color_manual(values = country_colors) +
  scale_size(range = c(1, 10)) +
  scale_x_log10(limits = c(150, 115000),
                labels = scales::comma) +
  facet_wrap(vars(continent)) +
  theme_ipsum() +
  coord_fixed(ratio = 1 / 43) +
  labs(title = 'Year: {frame_time}',
       x = 'GDP per capita', y = 'Life expectancy') +
  transition_time(year) +
  ease_aes('linear')

ggplot intro

Rosling’s famous animated plot

Plot types

  • Introduction

  • ggplot intro

  • Plot types

  • Grouping and summarizing

  • Styling

Plot types

Common plots

  • Scatter Plots (geom_point): relationships
  • Line Charts (geom_line): time series
  • Bar Charts (geom_col): comparisons
  • Histograms & Density Plots (geom_histogram, geom_density): distributions
  • Box Plots (geom_boxplot): grouped distributions
  • Statistical summaries (geom_smooth, geom_errorbar): presenting results

Plot types

An example dataset

Lets start by working with a really simple dataset with two groups:

df <- data.frame(
  x = c(1, 3, 4, 10, 8, 9, 3, 1, 5, 2),
  y = c(2, 6, 1, -4, 5, 1, 2, 3, 0, 4),
  gr = c(rep("a", 5), rep("b", 5))
)
df
    x  y gr
1   1  2  a
2   3  6  a
3   4  1  a
4  10 -4  a
5   8  5  a
6   9  1  b
7   3  2  b
8   1  3  b
9   5  0  b
10  2  4  b
p <- ggplot(data = df,
            mapping = aes(x=x, y=y))

Plot types

Individual geoms: geom_point()

p + geom_point()

Plot types

Individual geoms: geom_col()

p + geom_col()

This does not look right…

Plot types

Individual geoms: geom_col()

p + geom_col(
  position =
    position_dodge2(preserve = "single")
)

Default is position = "stack", this puts same x values on top of each other. Setting it to dodge separates overlapping values.

Plot types

Collective geoms

p + geom_line(aes(group = gr))

p + geom_line(aes(linetype = gr))

Collective geoms are plots with connected observations. group tells ggplot how to connect the data.

Plot types

Statistical summaries: geom_histogram()

ggplot(data=df, aes(x=x)) +
  geom_histogram()

Plot types

Statistical summaries: geom_smooth()

p + geom_point() + geom_smooth(method = "lm")

geom_smooth() fits a line through the data, defaults to loess

Plot types

Statistical summaries: geom_errorbar()

p + geom_errorbar(aes(ymin = y - 1, ymax = y + 1))

Useful for reporting e.g., coefficient plots, but requires ymin and ymax aesthetics.

Grouping and summarizing

  • Introduction

  • ggplot intro

  • Plot types

  • Grouping and summarizing

  • Styling

Grouping and summarizing

What if we wanted to plot how GDP per capita has developed for each country over time. A line plot should do this well.

ggplot(
  gapminder,
  aes(x = year,
      y = gdpPercap)
) +
  geom_line()

Any idea what’s wrong?

Grouping and summarizing

The group aesthetic

We need to tell ggplot to group the data by country.

ggplot(
  gapminder,
  aes(x = year,
      y = gdpPercap,
      group = country)
) +
  geom_line()

Still quite hard to see what’s going on!

Grouping and summarizing

Making things clearer

Let’s color lines by continent.

ggplot(
  gapminder,
  aes(x = year,
      y = gdpPercap,
      group = country,
      color = continent)
) +
  geom_line()

Still looks cluttered.

Grouping and summarizing

Faceting

Instead we could split the plot into subplots by continent.

ggplot(
  gapminder,
  aes(x = year,
      y = gdpPercap,
      group = country)
) +
  geom_line() +
  facet_wrap(vars(continent))

By default facet_wrap() keeps y axes the same. See ?facet_wrap for how to change this.

Styling

  • Introduction

  • ggplot intro

  • Plot types

  • Grouping and summarizing

  • Styling

Styling

A plot to work on

p <- ggplot(
  gapminder,
  aes(x = gdpPercap,
      y = lifeExp)
)
p + geom_point()

How can we increase readability?

Styling

Configuring scales: color

ggplot(gapminder,
       aes(x = gdpPercap,
           y = lifeExp)) +
  geom_point(aes(color = continent))

Styling

Configuring scales: size

ggplot(gapminder,
       aes(x = gdpPercap,
           y = lifeExp)) +
  geom_point(aes(color = continent,
                 size = pop),
             shape = 1, alpha = 0.75) +
  scale_size(labels = scales::comma)

  • Makes point size vary with population size,
  • with semi-transparent hollow circles
  • Change size scale to non-scientific

Styling

Configuring scales: logarithmic x-scale

ggplot(gapminder,
       aes(x = gdpPercap,
           y = lifeExp)) +
  geom_point(aes(color = continent,
                 size = pop),
             shape = 1, alpha = 0.75) +
  scale_size(labels = scales::comma) +
  scale_x_log10(labels = scales::dollar)

Styling

Adding (population weighted) regression lines

p <- ggplot(gapminder,
       aes(x = gdpPercap,
           y = lifeExp,
           color = continent)) +
  geom_point(aes(size = pop),
             shape = 1, alpha = 0.75) +
  scale_size(labels = scales::comma) +
  scale_x_log10(labels = scales::dollar) +
  geom_smooth(aes(weight = pop),
              linewidth = 0.8,
              method = "lm", se = FALSE)
p

Styling

Adding plot labels

p <- p +
  labs(
    x = "Log GDP per capita",
    y = "Life expectancy",
    color = "Continent",
    size = "Population size",
    title = "Prosperity brings health, or is it the other way around?",
    subtitle = "1952-2007",
    caption = "Data from Gapminder"
  )
p

Styling

theme() sets look and feel

p + theme_bw() +
  theme(plot.title = element_text(size=16))

p + hrbrthemes::theme_ipsum() +
  theme(legend.position = "bottom",
        legend.box = "vertical")

Styling

Aside on colors

  • We do not percieve all colors the same.
  • When plotting, try to use palettes designed for perceptual uniformity.
  • Three types of palettes, depending on data structure:
    • Sequential: for ordered data (e.g., income)
    • Diverging: ordered with midpoint (correlation, temperature)
    • Qualitative: unordered, categorical, data (countries, species)

Styling

Aside on colors: color blindness

8% of men and 0.5% of women have some form of color blindness. Let’s create a function to evaluate how different palettes look for people who are color blind.

library(dichromat)
library(paletteer)
colorblind_palette = function(palette) {
  melt(as.data.table(
    append(
      list(x = 1:length(palette),
           "Trichromacy (Original)" = palette),
      lapply(c("Protanopia"="protan", "Deutanopia"="deutan", "Tritanopia"="tritan"),
             dichromat, colours = palette)
    )
  ), id.vars = "x") |>
    ggplot(aes(x=x, y = 0, fill = value)) +
    geom_raster() + facet_wrap(vars(variable), nrow=4) +
    scale_fill_identity() + theme_void() + theme(legend.position = "none")
}

Styling

Aside on colors: color blindness (ggplot2 defaults)

colorblind_palette(scales::hue_pal()(5))

Styling

Aside on colors: color blindness

colorblind_palette(paletteer_d("RColorBrewer::RdYlGn", 10))

Styling

Aside on colors: color blindness

colorblind_palette(paletteer_d("wesanderson::Darjeeling2", 5))

Let’s go with this one for our plot!

Styling

Final result

Styling

Saving figures with ggsave()

  • It sounds easy, but saving figures can quite messy, even if it looked really good in the viewer.
  • ggsave(<filename>, p) saves the p plot object to a file
  • The file ending determines the graphics device:
    • Vector formats (.pdf, .svg) look much nicer
    • Raster formats (.png, .jpg) are easier to work with
  • Use arguments width, height, and scale to get the size right

Styling

Best practices

  • Chose the right plot for you data and question
  • Use labels to explain your plot
  • Keep it clean, don’t plot too much on the same chart
  • Use colors sparingly, effectively, accounting for colorblindness

Styling

Common pitfalls to avoid

  • Misleading axes (bar charts not starting at zero 🫢)
  • Overplotting (too much data)
  • Chartjunk (prominent grindlines, backgrounds)
  • 3D plots, pie charts (hard to read)

Next lecture: Data wrangling

Resources

References

Anscombe, F. J. 1973. “Graphs in Statistical Analysis.” The American Statistician, no. 27: 17–21.
Healy, Kieran. 2018. Data Visualization: A Practical Introduction. 1st edition. Princeton University Press.
Matejka, Justin, and George Fitzmaurice. 2017. “Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics Through Simulated Annealing.” In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, 1290–94. CHI ’17. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3025453.3025912.
Snow, John. 1855. On the Mode of Communication of Cholera. 2nd ed. John Churchill.
Tufte, Edward R. 1983. The Visual Display of Quantitative Information. 2nd ed. Graphics Press USA.
Wickham, Hadley, Mine Cetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd edition. O’Reilly Media.